RED WINE by HANDE KESKIN SUNGUR

I will discover the relationship between chemical properties and quality of wine using red wine data. The format includes Univarite, Bivariated and Multivariated analyses with a final summary and reflection at the end.

Summary of the Data Set

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

Examine histogram graphs of all values:

Quality of Red Wine

The median quality is 6 and mean is 5.636. The quality of samples range 3 to 8. Most of the quality ratings are either 5 or 6. The most of quality rating is 5.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Draw a histogram of the quality ratios:

We creat a new variable that called “rating” which is categorically quality divided into “bad”, “average”, and “excellent”.

** quality < 5 = ‘bad’ quality < 7 = ‘average’ quality > 6 = ‘excellent’**

Level of Alcohol in Red Wine

The median alcohol is 10.20 and mean is 10.42. The quality of samples range 8.40 to 14.90. Most of the quality ratings are either 5 or 6. The most of quality rating is 5. Red Wine data sample is small but it gives the same pattern of alcohol level distribution as red wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Level of Chlorides and Residual sugar in Red Wine

Residual sugar, chlorides distribution is long tailed distribution. So I transformed this data for a more accurate distribution. The log10 produces a more understandable distribution for both.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wine observation with 13 variables in the dataset . 11 of the variables are quantitative features . this Features are PH, density, volatile acidity, Fixed acidity ,citric acid, Free suldur dioxide, Total sulfur dioxide, Sulphates, Alchole.

The final variable of quantity scores the wine from 0 to 10. But Potential range from 3 to 8. All of the features have a minimum value greater than 0.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are quality. I would like to determine which features are best for predicting the quality of a wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Alcohol,fixed acidity,residual sugar likely contribute to the quality of a wine.

Did you create any new variables from existing variables in the dataset?

I convert some of the continious variables into discrete range. I creat a new variable that called “rating” which is categorically divided into “low”, “average”, and “high”. This grouping method will help me detect the difference among each group more easily.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The residual sugar histogram and Chlorides histogram did not look normal. I applied a log transform to x-axis.

Bivariate Plots Section

The top 4 correlation coefficients with quality are:

alchol-quality = 0.48

sulphates-quality = 0.26

citric.acid-quality = 0.22

fixed.acidity-quality = 0.12

Alcohol content has a high correlation with red wine quality.

The biggest negative corralation coefficients with quality are:

volatile.acidity-quality = -0.39

total.sulfur.dioxide-quality = -0.19

density-quality = -0.17

chlorides-quality = -0.13

Variables with the highest positive correlation include:

fixed.acidity-citirc.acid = 0.67

fixed.acidity-density = 0.67

free.sulfur.dioxide-total.sulfur.dioxide = 0.67

alcohol-quality = 0.48

sulphates-chlorides = 0.37

Variables with the highest positive correlation include:

fixed.acidity-pH = -0.68

volatile.acidity-citirc.acid = -0.55

citric.acid-pH = -0.54

density-alcohol = -0.50

volatile.acidity-quality = -0.39

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The biggest negative corralation coefficient with quality is volatile.acidity and The biggest positive corralation coefficient with quality is alcohol. From the plot, quality increases at moderate rates with higher alcohol. Red wine quality decreases as volatile acidity increases

## wineQualityReds$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## wineQualityReds$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## wineQualityReds$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

## wineQualityReds$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## wineQualityReds$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5650  0.6800  0.7242  0.8825  1.5800 
## -------------------------------------------------------- 
## wineQualityReds$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4055  0.4900  0.9150

We plot pH and fixed acidity. The correlation coefficient is -0.67, meaning that pH tends to drop at fixed acidity increases, which makes sense.

## [1] -0.6829782

sulphate content is quite important for red wine quality, particularly for the highest quality levels including excellent quality .

## $average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3700  0.5400  0.6100  0.6473  0.7000  1.9800 
## 
## $bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4950  0.5600  0.5922  0.6000  2.0000 
## 
## $excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7435  0.8200  1.3600

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I am surprised that the relationship between fixed.acidity and ph is the strongest relationship.

What was the strongest relationship you found?

From the variables analyzed, the strongest relationship was between fixed.acidity and pH, which had a correlation coefficient of -0.68.

Multivariate Plots Section

I noticed that quality increased as sulphates increased, When comparing sulphates to alcohol.
For excellent wines, alcohol played a important role in detecting quality given a sulphate level.

We can see higher quality wine have higher alcohol and lower volatile acidity.

We can see higher quality wine have higher sulphates, higher citric acidity.

There is no definitive evidence of the sugar content that is causing the bad wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

When looking at wine quality level, we see a positive relationship between alcohol and sulphates. also we see a negatif relation between quality and volatile.acidity.

Were there any interesting or surprising interactions between features?

I am suprising that residual sugar has very little impact on wine quality.


Final Plots and Summary

Plot One

Description One

This plot shows that the distribution of wine quality. You can show that dataset is unbalanced. It has many count for medium quality, but much fewer count on bad and excellent quality wine.

Plot Two

Description Two

In general, high quality wine tend to have higher alcohol and lower volatile acidity content. They also tend to have higher sulphate and higher critic acid content.

Plot Three

Description Three

When sulphates were low, the wine was still rated bad. Low sulphate content appears to contribute to bad wines. Also average wines have higher concentrations of sulphates. Excellent wines have higher alcohol contentrations and higher sulphate contentrations.


Reflection

Red wine dataset contains information on 1599 red wine that has got different chemical. initially I discover the relationship between chemical properties and quality of wine using red wine dataset. The wine quality is more complex. But plots and visuals make it easier to see where to explore more.

4 features that have the highest correlation coefficient with quality are alcohol, volatile acidity, sulphates,citric acid. Alcohol content appeared to be the number one factor for determining an excellent wine. Additionally excellent red wine contains specific amount of Citric acid and sulfates. Volatile acidity has a negative correlation to wine quality and I am suprising that residual sugar has very little impact on wine quality.

First I understanding the individual variables in the data set, and then I explored different questions and leads as I continued to make observations on plots. I have successfully identified features that impact the quality of red wine, visualized their relationships and summarize their statistics. I explored the quality of wines across many variables.Eventually I realised that good wine is more than perfect combination of different chemical components.

There are very few wines that are rated as low or high quality. I could do a better analysis if I had more information about the wines of the upper and lower classes. More information will certainly improve the accuracy of the prediction models. With this exploratory data analysis on the red wine dataset, I found the biggest challenging was sharing the right amount of information.